Goto

Collaborating Authors

 Dadra and Nagar Haveli and Daman and Diu


SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture

Maji, Arijit, Kumar, Raghvendra, Ghosh, Akash, Anushka, null, Saha, Sriparna

arXiv.org Artificial Intelligence

Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.


DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture

Maji, Arijit, Kumar, Raghvendra, Ghosh, Akash, Anushka, null, Shah, Nemil, Borah, Abhilekh, Shah, Vanshika, Mishra, Nishant, Saha, Sriparna

arXiv.org Artificial Intelligence

We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India's diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models' ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.


Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Raja, Rahul, Vats, Arpita

arXiv.org Artificial Intelligence

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.


Agricultural Landscape Understanding At Country-Scale

Dua, Radhika, Saxena, Nikita, Agarwal, Aditi, Wilson, Alex, Singh, Gaurav, Tran, Hoang, Deshpande, Ishan, Kaur, Amandeep, Aggarwal, Gaurav, Nath, Chandan, Basu, Arnab, Batchu, Vishal, Holla, Sharath, Kurle, Bindiya, Missura, Olana, Aggarwal, Rahul, Garg, Shubhika, Shah, Nishi, Singh, Avneet, Tewari, Dinesh, Dondzik, Agata, Adsul, Bharat, Sohoni, Milind, Praveen, Asim Rama, Dangi, Aaryan, Kadivar, Lisan, Abhishek, E, Sudhansu, Niranjan, Hattekar, Kamlakar, Datar, Sameer, Chaithanya, Musty Krishna, Reddy, Anumas Ranjith, Kumar, Aashish, Tirumala, Betala Laxmi, Talekar, Alok

arXiv.org Artificial Intelligence

The global food system is facing unprecedented challenges. In 2023, 2.4 billion people experienced moderate to severe food insecurity [1], a crisis precipitated by anthropogenic climate change and evolving dietary preferences. Furthermore, the food system itself significantly contributes to the climate crisis, with food loss and waste accounting for 2.4 gigatonnes of carbon dioxide equivalent emissions per year (GT CO2e/yr) [2], and the production, mismanagement, and misapplication of agricultural inputs such as fertilizers and manure generating an additional 2.5 GT CO2e/yr [3]. To sustain a projected global population of 9.6 billion by 2050, the Food and Agriculture Organization (FAO) estimates that food production must increase by at least 60% [1]. However, this also presents an opportunity: transitioning to sustainable agricultural practices can transform the sector from a net source of greenhouse gas emissions to a vital carbon sink.


Real Time Monitoring and Forecasting of COVID 19 Cases using an Adjusted Holt based Hybrid Model embedded with Wavelet based ANN

Das, Agniva, Muralidharan, Kunnummal

arXiv.org Machine Learning

Since the inception of the SARS - CoV - 2 (COVID - 19) novel coronavirus, a lot of time and effort is being allocated to estimate the trajectory and possibly, forecast with a reasonable degree of accuracy, the number of cases, recoveries, and deaths due to the same. The model proposed in this paper is a mindful step in the same direction. The primary model in question is a Hybrid Holt's Model embedded with a Wavelet-based ANN. To test its forecasting ability, we have compared three separate models, the first, being a simple ARIMA model, the second, also an ARIMA model with a wavelet-based function, and the third, being the proposed model. We have also compared the forecast accuracy of this model with that of a modern day Vanilla LSTM recurrent neural network model. We have tested the proposed model on the number of confirmed cases (daily) for the entire country as well as 6 hotspot states. We have also proposed a simple adjustment algorithm in addition to the hybrid model so that daily and/or weekly forecasts can be meted out, with respect to the entirety of the country, as well as a moving window performance metric based on out-of-sample forecasts. In order to have a more rounded approach to the analysis of COVID-19 dynamics, focus has also been given to the estimation of the Basic Reproduction Number, $R_0$ using a compartmental epidemiological model (SIR). Lastly, we have also given substantial attention to estimating the shelf-life of the proposed model. It is obvious yet noteworthy how an accurate model, in this regard, can ensure better allocation of healthcare resources, as well as, enable the government to take necessary measures ahead of time.